In [1]:
DATA_FOLDER = 'Data' # Use the data folder provided in Tutorial 02 - Intro to Pandas.
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
from dateutil.parser import parse
from os import listdir
from os.path import isfile, join
sns.set_context('notebook')

Task 1. Compiling Ebola Data

The DATA_FOLDER/ebola folder contains summarized reports of Ebola cases from three countries (Guinea, Liberia and Sierra Leone) during the recent outbreak of the disease in West Africa. For each country, there are daily reports that contain various information about the outbreak in several cities in each country.

Use pandas to import these data files into a single Dataframe. Using this DataFrame, calculate for each country, the daily average per month of new cases and deaths. Make sure you handle all the different expressions for new cases and deaths that are used in the reports.

First, we define some helpful functions that will help us during the parsing of the data.

  • get_files: returns all the .csv files for a given country

In [2]:
def get_files(country):
    path = DATA_FOLDER + "/ebola/" + country + "_data/"
    return [f for f in listdir(path) if isfile(join(path, f))]
  • sum_row: for a given row, returns the total value for the new cases / deaths. We first defined this function as the sum of all new cases / deaths in all provinces, but we discovered some strange data for some provinces, so we decided to only take into account the 'total' column
  • sum_rows: sum all the rows given in argument

In [3]:
def sum_row(row, total_col):
    return float(row[total_col].values[0])

def sum_rows(rows, total_col):
    tot = 0
    for row in rows:
        tot += sum_row(row, total_col)
    return tot

Now, we define for each country a function, which, for a given file, returns a dictionnary with the country, date, upper and lower bounds for the new cases, and upper and lower bounds for the new deaths. As we don't know if the new cases / deaths for the 'probable' and 'suspected' cases is reliable, we decided to create an upper bound with the sum of the 'confirmed', 'probable' and 'suspected' new cases / deaths, and a lower bound with only the 'confirmed' new cases / deaths.

The structure of these functions are the same for each country, only the name of the descrption of the data changes.


In [4]:
def get_row_guinea(file):
    country = 'guinea'
    date = file[:10]
    raw = pd.read_csv(DATA_FOLDER + "/ebola/" + country + "_data/" + file)
    total_col = "Totals"
    
    new_cases_lower = sum_row(raw[raw.Description == "New cases of confirmed"], total_col)
    new_cases_upper = sum_row(raw[raw.Description == "Total new cases registered so far"], total_col)
        
    new_deaths_lower = sum_row(raw[(raw.Description == "New deaths registered today (confirmed)") | (raw.Description == "New deaths registered")], total_col)
    new_deaths_upper = sum_row(raw[(raw.Description == "New deaths registered today") | (raw.Description == "New deaths registered")], total_col)
    
    return {'Country' : country, 'Date' : parse(date), 'NewCasesLower' : new_cases_lower, 'NewCasesUpper' : new_cases_upper, 'NewDeathsLower' : new_deaths_lower, 'NewDeathsUpper' : new_deaths_upper}

In [5]:
def get_row_liberia(file):
    country = 'liberia'
    date = file[:10]
    raw = pd.read_csv(DATA_FOLDER + "/ebola/" + country + "_data/" + file).fillna(0)
    total_col = "National"
    
    new_cases_lower = sum_row(raw[raw.Variable == "New case/s (confirmed)"], total_col)
    list_cases_upper = (["New Case/s (Suspected)", 
                        "New Case/s (Probable)",
                        "New case/s (confirmed)"])
    new_cases_upper = sum_rows([raw[raw.Variable == row] for row in list_cases_upper], total_col)
        
    new_deaths_lower = sum_row(raw[raw.Variable == "Newly reported deaths"], total_col)
    new_deaths_upper = new_deaths_lower
    
    return {'Country' : country, 'Date' : parse(date), 'NewCasesLower' : new_cases_lower, 'NewCasesUpper' : new_cases_upper, 'NewDeathsLower' : new_deaths_lower, 'NewDeathsUpper' : new_deaths_upper}

As the files for the Sierra Leone does not contain data for the new deaths, we first extract the total deaths for each day, and we will process them later to get the new deaths.


In [6]:
def get_row_sl(file):
    country = 'sl'
    date = file[:10]
    raw = pd.read_csv(DATA_FOLDER + "/ebola/" + country + "_data/" + file).fillna(0)
    total_col = "National"
        
    new_cases_lower = sum_row(raw[raw.variable == "new_confirmed"], total_col)
    list_cases_upper = (["new_suspected", 
                        "new_probable",
                        "new_confirmed"])
    new_cases_upper = sum_rows([raw[raw.variable == row] for row in list_cases_upper], total_col)
    
    list_death_upper = (["death_suspected", 
                        "death_probable",
                        "death_confirmed"])
    total_death_upper = sum_rows([raw[raw.variable == row] for row in list_death_upper], total_col)
    total_death_lower = sum_row(raw[raw.variable == "death_confirmed"], total_col)
    
    return {'Country' : country, 'Date' : parse(date), 'NewCasesLower' : new_cases_lower, 'NewCasesUpper' : new_cases_upper, 'TotalDeathLower' : total_death_lower, 'TotalDeathUpper' : total_death_upper}

In [7]:
rows_guinea = [get_row_guinea(file) for file in get_files("guinea")]

In [8]:
rows_liberia = [get_row_liberia(file) for file in get_files("liberia")]

We now transform the data for the Sierra Leone :

  • we first create a new dictionary for which the keys are date, and the values are the previously extracted values from the .csv files.
  • then for each value in this dictionary, we try to get the value of the day before, and perform the difference to get the new deaths of this day.

In [9]:
rows_sl_total_deaths = [get_row_sl(file) for file in get_files("sl")]
dic_sl_total_deaths = {}
for row in rows_sl_total_deaths:
    dic_sl_total_deaths[row['Date']] = row
    
rows_sl = []
for date, entry in dic_sl_total_deaths.items():
    date_before = date - datetime.timedelta(days=1)
    if date_before in dic_sl_total_deaths:
        
        if entry['TotalDeathUpper'] != 0 and dic_sl_total_deaths[date_before]['TotalDeathUpper'] != 0 and entry['TotalDeathLower'] != 0 and dic_sl_total_deaths[date_before]['TotalDeathLower'] != 0: 
            copy = dict(entry)
            del copy['TotalDeathUpper']
            del copy['TotalDeathLower']
            
            copy['NewDeathsUpper'] = entry['TotalDeathUpper'] - dic_sl_total_deaths[date_before]['TotalDeathUpper']
            copy['NewDeathsLower'] = entry['TotalDeathLower'] - dic_sl_total_deaths[date_before]['TotalDeathLower']

            rows_sl.append(copy)

We can now insert the data in a dataframe. For Liberia, December's data is in a completely different format so we dropped it: for instance, for some days, the new cases are the new cases for the day and for some other they are the total cases for this country.


In [10]:
raw_dataframe = pd.DataFrame(columns=['Country', 'Date', 'NewCasesLower', 'NewCasesUpper', 'NewDeathsLower', 'NewDeathsUpper'])
for row in rows_sl, rows_guinea:
    raw_dataframe = raw_dataframe.append(row, ignore_index = True)
for row in rows_liberia:
    if row['Date'].month != 12: #December data is erroneous
        raw_dataframe = raw_dataframe.append(row, ignore_index = True)
        
raw_dataframe


Out[10]:
Country Date NewCasesLower NewCasesUpper NewDeathsLower NewDeathsUpper
0 sl 2014-10-07 93.0 118.0 5.0 5.0
1 sl 2014-10-21 50.0 55.0 4.0 4.0
2 sl 2014-11-08 111.0 131.0 7.0 7.0
3 sl 2014-09-02 31.0 35.0 3.0 3.0
4 sl 2014-10-06 40.0 40.0 4.0 4.0
5 sl 2014-08-16 18.0 21.0 10.0 10.0
6 sl 2014-11-18 43.0 43.0 17.0 17.0
7 sl 2014-10-25 70.0 87.0 11.0 42.0
8 sl 2014-09-10 26.0 31.0 9.0 9.0
9 sl 2014-09-08 17.0 19.0 5.0 5.0
10 sl 2014-10-22 44.0 55.0 7.0 7.0
11 sl 2014-08-22 56.0 57.0 4.0 4.0
12 sl 2014-09-24 75.0 82.0 0.0 0.0
13 sl 2014-11-16 89.0 89.0 17.0 17.0
14 sl 2014-09-16 29.0 34.0 9.0 9.0
15 sl 2014-09-18 33.0 41.0 6.0 6.0
16 sl 2014-09-27 56.0 59.0 5.0 5.0
17 sl 2014-11-01 60.0 75.0 7.0 7.0
18 sl 2014-09-25 35.0 36.0 4.0 4.0
19 sl 2014-09-01 9.0 11.0 9.0 9.0
20 sl 2014-11-24 83.0 115.0 33.0 33.0
21 sl 2014-10-01 33.0 43.0 -18.0 -18.0
22 sl 2014-08-23 23.0 27.0 3.0 6.0
23 sl 2014-10-13 45.0 47.0 5.0 15.0
24 sl 2014-10-20 72.0 80.0 11.0 11.0
25 sl 2014-09-26 23.0 32.0 0.0 0.0
26 sl 2014-10-11 94.0 125.0 17.0 17.0
27 sl 2014-11-17 53.0 53.0 0.0 0.0
28 sl 2014-11-29 64.0 88.0 15.0 15.0
29 sl 2014-10-23 60.0 67.0 4.0 18.0
... ... ... ... ... ... ...
161 liberia 2014-10-28 0.0 55.0 30.0 30.0
162 liberia 2014-07-10 0.0 3.0 4.0 4.0
163 liberia 2014-06-22 5.0 10.0 4.0 4.0
164 liberia 2014-10-11 0.0 31.0 33.0 33.0
165 liberia 2014-06-28 1.0 9.0 1.0 1.0
166 liberia 2014-11-19 0.0 36.0 21.0 21.0
167 liberia 2014-10-24 0.0 35.0 40.0 40.0
168 liberia 2014-07-07 5.0 7.0 2.0 2.0
169 liberia 2014-09-28 2.0 24.0 4.0 4.0
170 liberia 2014-10-03 2.0 58.0 38.0 38.0
171 liberia 2014-06-24 4.0 6.0 4.0 4.0
172 liberia 2014-08-28 22.0 61.0 51.0 51.0
173 liberia 2014-07-17 0.0 10.0 3.0 3.0
174 liberia 2014-09-08 12.0 65.0 47.0 47.0
175 liberia 2014-06-29 2.0 2.0 0.0 0.0
176 liberia 2014-10-09 3.0 60.0 45.0 45.0
177 liberia 2014-11-15 0.0 39.0 12.0 12.0
178 liberia 2014-07-20 1.0 3.0 4.0 4.0
179 liberia 2014-08-17 0.0 24.0 11.0 11.0
180 liberia 2014-11-23 0.0 7.0 0.0 0.0
181 liberia 2014-11-24 0.0 25.0 15.0 15.0
182 liberia 2014-06-17 0.0 2.0 0.0 0.0
183 liberia 2014-08-12 2.0 60.0 29.0 29.0
184 liberia 2014-09-16 4.0 99.0 54.0 54.0
185 liberia 2014-09-02 3.0 76.0 47.0 47.0
186 liberia 2014-09-23 16.0 88.0 51.0 51.0
187 liberia 2014-10-25 1.0 31.0 24.0 24.0
188 liberia 2014-10-21 0.0 28.0 23.0 23.0
189 liberia 2014-11-20 0.0 40.0 30.0 30.0
190 liberia 2014-08-04 1.0 10.0 7.0 7.0

191 rows × 6 columns


In [11]:
dataframe = raw_dataframe.set_index(['Country', 'Date'])

In [12]:
dataframe_no_day = raw_dataframe
dataframe_no_day['Year'] = raw_dataframe['Date'].apply(lambda x: x.year)
dataframe_no_day['Month'] = raw_dataframe['Date'].apply(lambda x: x.month)
final_df = dataframe_no_day[['Country', 'Year', 'Month', 'NewCasesLower', 'NewCasesUpper', 'NewDeathsLower', 'NewDeathsUpper']].groupby(['Country', 'Year', 'Month']).mean()
final_df


Out[12]:
NewCasesLower NewCasesUpper NewDeathsLower NewDeathsUpper
Country Year Month
guinea 2014 8 12.400000 25.800000 3.400000 3.400000
9 13.000000 19.625000 3.562500 3.562500
10 6.000000 34.000000 15.000000 15.000000
liberia 2014 6 2.142857 5.714286 2.000000 2.000000
7 1.818182 8.545455 4.272727 4.272727
8 5.444444 37.222222 23.222222 23.222222
9 6.166667 63.833333 36.041667 36.041667
10 1.360000 45.560000 28.040000 28.040000
11 2.600000 26.466667 13.466667 13.466667
sl 2014 8 19.266667 26.666667 6.400000 6.800000
9 37.360000 44.160000 5.120000 5.120000
10 59.956522 73.652174 13.086957 23.391304
11 63.846154 78.153846 16.076923 16.076923
12 27.000000 39.000000 30.000000 30.000000

Finally, to have some final general idea for the data, we average the bounds.


In [13]:
s1 = final_df[['NewCasesLower', 'NewCasesUpper']].mean(axis=1)
s2 = final_df[['NewDeathsLower', 'NewDeathsUpper']].mean(axis=1)
final = pd.concat([s1, s2], axis=1)
final.columns = ['NewCasesAverage', 'NewDeathsAverage']
final


Out[13]:
NewCasesAverage NewDeathsAverage
Country Year Month
guinea 2014 8 19.100000 3.400000
9 16.312500 3.562500
10 20.000000 15.000000
liberia 2014 6 3.928571 2.000000
7 5.181818 4.272727
8 21.333333 23.222222
9 35.000000 36.041667
10 23.460000 28.040000
11 14.533333 13.466667
sl 2014 8 22.966667 6.600000
9 40.760000 5.120000
10 66.804348 18.239130
11 71.000000 16.076923
12 33.000000 30.000000

Task 2. RNA Sequences

In the DATA_FOLDER/microbiome subdirectory, there are 9 spreadsheets of microbiome data that was acquired from high-throughput RNA sequencing procedures, along with a 10th file that describes the content of each.

Use pandas to import the first 9 spreadsheets into a single DataFrame. Then, add the metadata information from the 10th spreadsheet as columns in the combined DataFrame. Make sure that the final DataFrame has a unique index and all the NaN values have been replaced by the tag unknown.

We load the first spreadsheet from the file's Sheet 1. Then we add a new column that is the same for all the data in this import, which corresponds to the barcode of the code.

Then we rename the columns for more clarity.


In [14]:
mid = pd.read_excel(DATA_FOLDER + '/microbiome/MID1.xls', sheetname='Sheet 1', header=None)
mid.fillna('unknown', inplace=True)
mid['BARCODE'] = 'MID1'
mid.columns = ['Taxon', 'Count', 'BARCODE']

Now we repeat this operation for every other spreadsheet except the metadata. At each iteration we simply concatenate the data at the end of the previous data, this accumulating all the files' data into a single dataframe. We don't care about any index right now since we will use a random one later.


In [15]:
for i in range(2, 10):
    midi = pd.read_excel(DATA_FOLDER + '/microbiome/MID' + str(i) + '.xls', sheetname='Sheet 1', header=None)
    midi.fillna('unknown', inplace=True)
    midi['BARCODE'] = 'MID' + str(i)
    midi.columns = ['Taxon', 'Count', 'BARCODE']
    mid = pd.concat([mid, midi])

Finally, we do a merge with the metadata. We join on the BARCODE column. This column will be the index of the metadata when we import it in this case. Finally we set the index for the three columns BARCODE, GROUP and SAMPLE which are all the columns of the metada and are unique.

The only NaN value we found was the NA value on the metadata, which may indicate that there is no sample for the first group. We replaced it anyway by unknown.


In [16]:
metadata = pd.read_excel(DATA_FOLDER + '/microbiome/metadata.xls', sheetname='Sheet1', index_col=0)
metadata.fillna('unknown', inplace=True)
merged = pd.merge(mid, metadata, right_index=True, left_on='BARCODE')
merged = merged.set_index(keys=['BARCODE', 'Taxon'])
merged


Out[16]:
Count GROUP SAMPLE
BARCODE Taxon
MID1 Archaea "Crenarchaeota" Thermoprotei Desulfurococcales Desulfurococcaceae Ignisphaera 7 EXTRACTION CONTROL unknown
Archaea "Crenarchaeota" Thermoprotei Desulfurococcales Pyrodictiaceae Pyrolobus 2 EXTRACTION CONTROL unknown
Archaea "Crenarchaeota" Thermoprotei Sulfolobales Sulfolobaceae Stygiolobus 3 EXTRACTION CONTROL unknown
Archaea "Crenarchaeota" Thermoprotei Thermoproteales Thermofilaceae Thermofilum 3 EXTRACTION CONTROL unknown
Archaea "Euryarchaeota" "Methanomicrobia" Methanocellales Methanocellaceae Methanocella 7 EXTRACTION CONTROL unknown
Archaea "Euryarchaeota" "Methanomicrobia" Methanosarcinales Methanosarcinaceae Methanimicrococcus 1 EXTRACTION CONTROL unknown
Archaea "Euryarchaeota" "Methanomicrobia" Methanosarcinales Methermicoccaceae Methermicoccus 1 EXTRACTION CONTROL unknown
Archaea "Euryarchaeota" Archaeoglobi Archaeoglobales Archaeoglobaceae Ferroglobus 1 EXTRACTION CONTROL unknown
Archaea "Euryarchaeota" Archaeoglobi Archaeoglobales Archaeoglobaceae Geoglobus 1 EXTRACTION CONTROL unknown
Archaea "Euryarchaeota" Halobacteria Halobacteriales Halobacteriaceae Haloplanus 1 EXTRACTION CONTROL unknown
Archaea "Euryarchaeota" Halobacteria Halobacteriales Halobacteriaceae Haloquadratum 4 EXTRACTION CONTROL unknown
Archaea "Euryarchaeota" Halobacteria Halobacteriales Halobacteriaceae Halosimplex 1 EXTRACTION CONTROL unknown
Archaea "Euryarchaeota" Halobacteria Halobacteriales Halobacteriaceae Natronobacterium 2 EXTRACTION CONTROL unknown
Archaea "Euryarchaeota" Halobacteria Halobacteriales Halobacteriaceae Natronomonas 4 EXTRACTION CONTROL unknown
Archaea "Euryarchaeota" Methanococci Methanococcales Methanocaldococcaceae Methanotorris 1 EXTRACTION CONTROL unknown
Archaea "Euryarchaeota" Methanopyri Methanopyrales Methanopyraceae Methanopyrus 12 EXTRACTION CONTROL unknown
Archaea "Euryarchaeota" Thermoplasmata Thermoplasmatales Thermoplasmatales_incertae_sedis Thermogymnomonas 2 EXTRACTION CONTROL unknown
Bacteria "Actinobacteria" Actinobacteria Actinomycetales Acidothermaceae Acidothermus 2 EXTRACTION CONTROL unknown
Bacteria "Actinobacteria" Actinobacteria Actinomycetales Beutenbergiaceae Salana 1 EXTRACTION CONTROL unknown
Bacteria "Actinobacteria" Actinobacteria Actinomycetales Bogoriellaceae Bogoriella 1 EXTRACTION CONTROL unknown
Bacteria "Actinobacteria" Actinobacteria Actinomycetales Cellulomonadaceae Tropheryma 1 EXTRACTION CONTROL unknown
Bacteria "Actinobacteria" Actinobacteria Actinomycetales Corynebacteriaceae Turicella 2 EXTRACTION CONTROL unknown
Bacteria "Actinobacteria" Actinobacteria Actinomycetales Cryptosporangiaceae Cryptosporangium 53 EXTRACTION CONTROL unknown
Bacteria "Actinobacteria" Actinobacteria Actinomycetales Dermabacteraceae Dermabacter 1 EXTRACTION CONTROL unknown
Bacteria "Actinobacteria" Actinobacteria Actinomycetales Dermabacteraceae Devriesea 2 EXTRACTION CONTROL unknown
Bacteria "Actinobacteria" Actinobacteria Actinomycetales Dermatophilaceae Kineosphaera 3 EXTRACTION CONTROL unknown
Bacteria "Actinobacteria" Actinobacteria Actinomycetales Frankineae_incertae_sedis Fodinicola 1 EXTRACTION CONTROL unknown
Bacteria "Actinobacteria" Actinobacteria Actinomycetales Geodermatophilaceae Blastococcus 99 EXTRACTION CONTROL unknown
Bacteria "Actinobacteria" Actinobacteria Actinomycetales Geodermatophilaceae Geodermatophilus 17 EXTRACTION CONTROL unknown
Bacteria "Actinobacteria" Actinobacteria Actinomycetales Intrasporangiaceae Kribbia 2 EXTRACTION CONTROL unknown
... ... ... ... ...
MID9 Bacteria "Proteobacteria" Gammaproteobacteria Oceanospirillales Oceanospirillaceae Neptuniibacter 10 Control 2 stool
Bacteria "Proteobacteria" Gammaproteobacteria Oceanospirillales Oceanospirillales_incertae_sedis Spongiispira 1 Control 2 stool
Bacteria "Proteobacteria" Gammaproteobacteria Oceanospirillales Oleiphilaceae Oleiphilus 1 Control 2 stool
Bacteria "Proteobacteria" Gammaproteobacteria Pasteurellales Pasteurellaceae Haemophilus 1 Control 2 stool
Bacteria "Proteobacteria" Gammaproteobacteria Pseudomonadales Moraxellaceae Alkanindiges 12 Control 2 stool
Bacteria "Proteobacteria" Gammaproteobacteria Pseudomonadales Moraxellaceae Perlucidibaca 4 Control 2 stool
Bacteria "Proteobacteria" Gammaproteobacteria Pseudomonadales Pseudomonadaceae Azomonas 805 Control 2 stool
Bacteria "Proteobacteria" Gammaproteobacteria Pseudomonadales Pseudomonadaceae Azorhizophilus 862 Control 2 stool
Bacteria "Proteobacteria" Gammaproteobacteria Pseudomonadales Pseudomonadaceae Azotobacter 12 Control 2 stool
Bacteria "Proteobacteria" Gammaproteobacteria Pseudomonadales Pseudomonadaceae Pseudomonas 3534 Control 2 stool
Bacteria "Proteobacteria" Gammaproteobacteria Thiotrichales Piscirickettsiaceae Cycloclasticus 4 Control 2 stool
Bacteria "Proteobacteria" Gammaproteobacteria Thiotrichales Piscirickettsiaceae Sulfurivirga 1 Control 2 stool
Bacteria "Proteobacteria" Gammaproteobacteria Thiotrichales Thiotrichaceae Beggiatoa 25 Control 2 stool
Bacteria "Proteobacteria" Gammaproteobacteria Thiotrichales Thiotrichales_incertae_sedis Fangia 2 Control 2 stool
Bacteria "Proteobacteria" Gammaproteobacteria Xanthomonadales Sinobacteraceae Alkanibacter 3 Control 2 stool
Bacteria "Proteobacteria" Gammaproteobacteria Xanthomonadales Sinobacteraceae Sinobacter 2 Control 2 stool
Bacteria "Proteobacteria" Gammaproteobacteria Xanthomonadales Xanthomonadaceae Aquimonas 1 Control 2 stool
Bacteria "Proteobacteria" Gammaproteobacteria Xanthomonadales Xanthomonadaceae Frateuria 3 Control 2 stool
Bacteria "Proteobacteria" Gammaproteobacteria Xanthomonadales Xanthomonadaceae Ignatzschineria 2 Control 2 stool
Bacteria "Proteobacteria" Gammaproteobacteria Xanthomonadales Xanthomonadaceae Luteibacter 1 Control 2 stool
Bacteria "Synergistetes" Synergistia Synergistales Synergistaceae Aminomonas 3 Control 2 stool
Bacteria "Tenericutes" Mollicutes Anaeroplasmatales Anaeroplasmataceae Asteroleplasma 1 Control 2 stool
Bacteria "Tenericutes" Mollicutes Haloplasmatales Haloplasmataceae Haloplasma 11 Control 2 stool
Bacteria "Thermodesulfobacteria" Thermodesulfobacteria Thermodesulfobacteriales Thermodesulfobacteriaceae Caldimicrobium 1 Control 2 stool
Bacteria "Thermodesulfobacteria" Thermodesulfobacteria Thermodesulfobacteriales Thermodesulfobacteriaceae Thermodesulfatator 1 Control 2 stool
Bacteria "Thermotogae" Thermotogae Thermotogales Thermotogaceae Geotoga 43 Control 2 stool
Bacteria "Thermotogae" Thermotogae Thermotogales Thermotogaceae Kosmotoga 16 Control 2 stool
Bacteria "Verrucomicrobia" Opitutae Opitutales Opitutaceae Opitutus 1 Control 2 stool
Bacteria Cyanobacteria Cyanobacteria Chloroplast Chlorarachniophyceae 1 Control 2 stool
Bacteria Cyanobacteria Cyanobacteria Chloroplast Streptophyta 10 Control 2 stool

2396 rows × 3 columns

Task 3. Class War in Titanic

Use pandas to import the data file Data/titanic.xls. It contains data on all the passengers that travelled on the Titanic.


In [17]:
from IPython.core.display import HTML
HTML(filename=DATA_FOLDER+'/titanic.html')


Out[17]:

Data frame:titanic3

1309 observations and 14 variables, maximum # NAs:1188
NameLabelsUnitsLevelsStorageNAs
pclass
3
integer
0
survivedSurvived
double
0
nameName
character
0
sex
2
integer
0
ageAgeYear
double
263
sibspNumber of Siblings/Spouses Aboard
double
0
parchNumber of Parents/Children Aboard
double
0
ticketTicket Number
character
0
farePassenger FareBritish Pound (\243)
double
1
cabin
187
integer
0
embarked
3
integer
2
boat
28
integer
0
bodyBody Identification Number
double
1188
home.destHome/Destination
character
0

VariableLevels
pclass1st
2nd
3rd
sexfemale
male
cabin
A10
A11
A14
A16
A18
A19
A20
A21
A23
A24
A26
A29
A31
A32
A34
A36
A5
A6
A7
A9
B10
B101
B102
B11
B18
B19
B20
B22
B24
B26
B28
B3
B30
B35
B36
B37
B38
B39
B4
B41
B42
B45
B49
B5
B50
B51 B53 B55
B52 B54 B56
B57 B59 B63 B66
B58 B60
B61
B69
B71
B73
B77
B78
B79
B80
B82 B84
B86
B94
B96 B98
C101
C103
C104
C105
C106
C110
C111
C116
C118
C123
C124
C125
C126
C128
C130
C132
C148
C2
C22 C26
C23 C25 C27
C28
C30
C31
C32
C39
C45
C46
C47
C49
C50
C51
C52
C53
C54
C55 C57
C6
C62 C64
C65
C68
C7
C70
C78
C80
C82
C83
C85
C86
C87
C89
C90
C91
C92
C93
C95
C97
C99
D
D10 D12
D11
D15
D17
D19
D20
D21
D22
D26
D28
D30
D33
D34
D35
D36
D37
D38
D40
D43
D45
D46
D47
D48
D49
D50
D56
D6
D7
D9
E10
E101
E12
E121
E17
E24
E25
E31
E33
E34
E36
E38
E39 E41
E40
E44
E45
E46
E49
E50
E52
E58
E60
E63
E67
E68
E77
E8
F
F E46
F E57
F E69
F G63
F G73
F2
F33
F38
F4
G6
T
embarkedCherbourg
Queenstown
Southampton
boat
1
10
11
12
13
13 15
13 15 B
14
15
15 16
16
2
3
4
5
5 7
5 9
6
7
8
8 10
9
A
B
C
C D
D

For each of the following questions state clearly your assumptions and discuss your findings:

  1. Describe the type and the value range of each attribute. Indicate and transform the attributes that can be Categorical.
  2. Plot histograms for the travel class, embarkation port, sex and age attributes. For the latter one, use discrete decade intervals.
  3. Calculate the proportion of passengers by cabin floor. Present your results in a pie chart.
  4. For each travel class, calculate the proportion of the passengers that survived. Present your results in pie charts.
  5. Calculate the proportion of the passengers that survived by travel class and sex. Present your results in a single histogram.
  6. Create 2 equally populated age categories and calculate survival proportions by age category, travel class and sex. Present your results in a DataFrame with unique index.

1. We start by importing the data from the file.


In [18]:
titanic = pd.read_excel(DATA_FOLDER + '/titanic.xls', sheetname='titanic')
titanic


Out[18]:
pclass survived name sex age sibsp parch ticket fare cabin embarked boat body home.dest
0 1 1 Allen, Miss. Elisabeth Walton female 29.0000 0 0 24160 211.3375 B5 S 2 NaN St Louis, MO
1 1 1 Allison, Master. Hudson Trevor male 0.9167 1 2 113781 151.5500 C22 C26 S 11 NaN Montreal, PQ / Chesterville, ON
2 1 0 Allison, Miss. Helen Loraine female 2.0000 1 2 113781 151.5500 C22 C26 S NaN NaN Montreal, PQ / Chesterville, ON
3 1 0 Allison, Mr. Hudson Joshua Creighton male 30.0000 1 2 113781 151.5500 C22 C26 S NaN 135.0 Montreal, PQ / Chesterville, ON
4 1 0 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) female 25.0000 1 2 113781 151.5500 C22 C26 S NaN NaN Montreal, PQ / Chesterville, ON
5 1 1 Anderson, Mr. Harry male 48.0000 0 0 19952 26.5500 E12 S 3 NaN New York, NY
6 1 1 Andrews, Miss. Kornelia Theodosia female 63.0000 1 0 13502 77.9583 D7 S 10 NaN Hudson, NY
7 1 0 Andrews, Mr. Thomas Jr male 39.0000 0 0 112050 0.0000 A36 S NaN NaN Belfast, NI
8 1 1 Appleton, Mrs. Edward Dale (Charlotte Lamson) female 53.0000 2 0 11769 51.4792 C101 S D NaN Bayside, Queens, NY
9 1 0 Artagaveytia, Mr. Ramon male 71.0000 0 0 PC 17609 49.5042 NaN C NaN 22.0 Montevideo, Uruguay
10 1 0 Astor, Col. John Jacob male 47.0000 1 0 PC 17757 227.5250 C62 C64 C NaN 124.0 New York, NY
11 1 1 Astor, Mrs. John Jacob (Madeleine Talmadge Force) female 18.0000 1 0 PC 17757 227.5250 C62 C64 C 4 NaN New York, NY
12 1 1 Aubart, Mme. Leontine Pauline female 24.0000 0 0 PC 17477 69.3000 B35 C 9 NaN Paris, France
13 1 1 Barber, Miss. Ellen "Nellie" female 26.0000 0 0 19877 78.8500 NaN S 6 NaN NaN
14 1 1 Barkworth, Mr. Algernon Henry Wilson male 80.0000 0 0 27042 30.0000 A23 S B NaN Hessle, Yorks
15 1 0 Baumann, Mr. John D male NaN 0 0 PC 17318 25.9250 NaN S NaN NaN New York, NY
16 1 0 Baxter, Mr. Quigg Edmond male 24.0000 0 1 PC 17558 247.5208 B58 B60 C NaN NaN Montreal, PQ
17 1 1 Baxter, Mrs. James (Helene DeLaudeniere Chaput) female 50.0000 0 1 PC 17558 247.5208 B58 B60 C 6 NaN Montreal, PQ
18 1 1 Bazzani, Miss. Albina female 32.0000 0 0 11813 76.2917 D15 C 8 NaN NaN
19 1 0 Beattie, Mr. Thomson male 36.0000 0 0 13050 75.2417 C6 C A NaN Winnipeg, MN
20 1 1 Beckwith, Mr. Richard Leonard male 37.0000 1 1 11751 52.5542 D35 S 5 NaN New York, NY
21 1 1 Beckwith, Mrs. Richard Leonard (Sallie Monypeny) female 47.0000 1 1 11751 52.5542 D35 S 5 NaN New York, NY
22 1 1 Behr, Mr. Karl Howell male 26.0000 0 0 111369 30.0000 C148 C 5 NaN New York, NY
23 1 1 Bidois, Miss. Rosalie female 42.0000 0 0 PC 17757 227.5250 NaN C 4 NaN NaN
24 1 1 Bird, Miss. Ellen female 29.0000 0 0 PC 17483 221.7792 C97 S 8 NaN NaN
25 1 0 Birnbaum, Mr. Jakob male 25.0000 0 0 13905 26.0000 NaN C NaN 148.0 San Francisco, CA
26 1 1 Bishop, Mr. Dickinson H male 25.0000 1 0 11967 91.0792 B49 C 7 NaN Dowagiac, MI
27 1 1 Bishop, Mrs. Dickinson H (Helen Walton) female 19.0000 1 0 11967 91.0792 B49 C 7 NaN Dowagiac, MI
28 1 1 Bissette, Miss. Amelia female 35.0000 0 0 PC 17760 135.6333 C99 S 8 NaN NaN
29 1 1 Bjornstrom-Steffansson, Mr. Mauritz Hakan male 28.0000 0 0 110564 26.5500 C52 S D NaN Stockholm, Sweden / Washington, DC
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1279 3 0 Vestrom, Miss. Hulda Amanda Adolfina female 14.0000 0 0 350406 7.8542 NaN S NaN NaN NaN
1280 3 0 Vovk, Mr. Janko male 22.0000 0 0 349252 7.8958 NaN S NaN NaN NaN
1281 3 0 Waelens, Mr. Achille male 22.0000 0 0 345767 9.0000 NaN S NaN NaN Antwerp, Belgium / Stanton, OH
1282 3 0 Ware, Mr. Frederick male NaN 0 0 359309 8.0500 NaN S NaN NaN NaN
1283 3 0 Warren, Mr. Charles William male NaN 0 0 C.A. 49867 7.5500 NaN S NaN NaN NaN
1284 3 0 Webber, Mr. James male NaN 0 0 SOTON/OQ 3101316 8.0500 NaN S NaN NaN NaN
1285 3 0 Wenzel, Mr. Linhart male 32.5000 0 0 345775 9.5000 NaN S NaN 298.0 NaN
1286 3 1 Whabee, Mrs. George Joseph (Shawneene Abi-Saab) female 38.0000 0 0 2688 7.2292 NaN C C NaN NaN
1287 3 0 Widegren, Mr. Carl/Charles Peter male 51.0000 0 0 347064 7.7500 NaN S NaN NaN NaN
1288 3 0 Wiklund, Mr. Jakob Alfred male 18.0000 1 0 3101267 6.4958 NaN S NaN 314.0 NaN
1289 3 0 Wiklund, Mr. Karl Johan male 21.0000 1 0 3101266 6.4958 NaN S NaN NaN NaN
1290 3 1 Wilkes, Mrs. James (Ellen Needs) female 47.0000 1 0 363272 7.0000 NaN S NaN NaN NaN
1291 3 0 Willer, Mr. Aaron ("Abi Weller") male NaN 0 0 3410 8.7125 NaN S NaN NaN NaN
1292 3 0 Willey, Mr. Edward male NaN 0 0 S.O./P.P. 751 7.5500 NaN S NaN NaN NaN
1293 3 0 Williams, Mr. Howard Hugh "Harry" male NaN 0 0 A/5 2466 8.0500 NaN S NaN NaN NaN
1294 3 0 Williams, Mr. Leslie male 28.5000 0 0 54636 16.1000 NaN S NaN 14.0 NaN
1295 3 0 Windelov, Mr. Einar male 21.0000 0 0 SOTON/OQ 3101317 7.2500 NaN S NaN NaN NaN
1296 3 0 Wirz, Mr. Albert male 27.0000 0 0 315154 8.6625 NaN S NaN 131.0 NaN
1297 3 0 Wiseman, Mr. Phillippe male NaN 0 0 A/4. 34244 7.2500 NaN S NaN NaN NaN
1298 3 0 Wittevrongel, Mr. Camille male 36.0000 0 0 345771 9.5000 NaN S NaN NaN NaN
1299 3 0 Yasbeck, Mr. Antoni male 27.0000 1 0 2659 14.4542 NaN C C NaN NaN
1300 3 1 Yasbeck, Mrs. Antoni (Selini Alexander) female 15.0000 1 0 2659 14.4542 NaN C NaN NaN NaN
1301 3 0 Youseff, Mr. Gerious male 45.5000 0 0 2628 7.2250 NaN C NaN 312.0 NaN
1302 3 0 Yousif, Mr. Wazli male NaN 0 0 2647 7.2250 NaN C NaN NaN NaN
1303 3 0 Yousseff, Mr. Gerious male NaN 0 0 2627 14.4583 NaN C NaN NaN NaN
1304 3 0 Zabour, Miss. Hileni female 14.5000 1 0 2665 14.4542 NaN C NaN 328.0 NaN
1305 3 0 Zabour, Miss. Thamine female NaN 1 0 2665 14.4542 NaN C NaN NaN NaN
1306 3 0 Zakarian, Mr. Mapriededer male 26.5000 0 0 2656 7.2250 NaN C NaN 304.0 NaN
1307 3 0 Zakarian, Mr. Ortin male 27.0000 0 0 2670 7.2250 NaN C NaN NaN NaN
1308 3 0 Zimmerman, Mr. Leo male 29.0000 0 0 315082 7.8750 NaN S NaN NaN NaN

1309 rows × 14 columns

Next we can list the data types of each field.


In [19]:
titanic.dtypes


Out[19]:
pclass         int64
survived       int64
name          object
sex           object
age          float64
sibsp          int64
parch          int64
ticket        object
fare         float64
cabin         object
embarked      object
boat          object
body         float64
home.dest     object
dtype: object

When it comes to the object fields, we can be a bit more precise. name, sec, ticket, cabin, embarked, boat and home.dex are all strings.

Next, we call the describe method to list some statistics on the data. We thus obtain the range of all of the numeric fields of our data.


In [20]:
titanic.describe()


Out[20]:
pclass survived age sibsp parch fare body
count 1309.000000 1309.000000 1046.000000 1309.000000 1309.000000 1308.000000 121.000000
mean 2.294882 0.381971 29.881135 0.498854 0.385027 33.295479 160.809917
std 0.837836 0.486055 14.413500 1.041658 0.865560 51.758668 97.696922
min 1.000000 0.000000 0.166700 0.000000 0.000000 0.000000 1.000000
25% 2.000000 0.000000 21.000000 0.000000 0.000000 7.895800 72.000000
50% 3.000000 0.000000 28.000000 0.000000 0.000000 14.454200 155.000000
75% 3.000000 1.000000 39.000000 1.000000 0.000000 31.275000 256.000000
max 3.000000 1.000000 80.000000 8.000000 9.000000 512.329200 328.000000

Moreover, we can also note some ranges of other fields. For example, sex has only two possible values female and male. embarked can only be S, C and Q.

For a better visual result, we decided to replace the travel classes, ports to more readable values. As we make them categorical types, the performance stays the same.


In [21]:
class_dic = {1 : 'First Class', 2 : 'Second Class', 3 : 'Third Class', np.nan : np.nan}
survived_dic = {0 : 'Deceased' , 1 : 'Survived', np.nan : np.nan}
emarked_dic = {'C' : 'Cherbourg', 'Q' : 'Queenstown', 'S' : 'Southampton', np.nan : np.nan}
titanic['pclass'] = titanic['pclass'].apply(lambda x: class_dic[x])

titanic['survived'] = titanic['survived'].apply(lambda x: survived_dic[x])

titanic['embarked'] = titanic['embarked'].apply(lambda x: emarked_dic[x])

Then we make categorical data as actually categorical.


In [22]:
titanic['pclass'] = titanic.pclass.astype('category')
titanic['survived'] = titanic.survived.astype('category')
titanic['sex'] = titanic.sex.astype('category')
titanic['embarked'] = titanic.embarked.astype('category')
titanic['cabin'] = titanic.cabin.astype('category')
titanic['boat'] = titanic.boat.astype('category')

2. We plot the histogram of the travel class.


In [23]:
titanic.pclass.value_counts(sort=False).plot(kind='bar')


Out[23]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f4715463828>

Next we plot the histogram of the three embark ports.


In [24]:
titanic.embarked.value_counts().plot(kind='bar')


Out[24]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f4715120128>

Next we plot the histogram of the sex.


In [25]:
titanic.sex.value_counts().plot(kind='bar')


Out[25]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f47150d8588>

Next, we cut the ages data into decades and plot the histogram of the devades.


In [26]:
pd.cut(titanic.age, range(0,90,10)).value_counts(sort=False).plot(kind='bar')


Out[26]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f4714f826d8>

3. We plot the cabin floor data as a pie chart.


In [27]:
titanic.cabin.dropna().apply(lambda x : x[0]).value_counts(sort=False).plot(kind='pie')


Out[27]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f4714f1fda0>

4. Here, we plot the proportion of people that survived in the first class.


In [28]:
titanic[titanic.pclass == "First Class"].survived.value_counts(sort=False).plot(kind='pie')


Out[28]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f4714eee240>

Next, we plot the proportion of people that survived in the second class.


In [29]:
titanic[titanic.pclass == "Second Class"].survived.value_counts(sort=False).plot(kind='pie')


Out[29]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f4714e13e80>

Finally, we plot the proportion of people that survived in the third class.


In [30]:
titanic[titanic.pclass == "Third Class"].survived.value_counts(sort=False).plot(kind='pie')


Out[30]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f4714d8a908>

As we can see, the lower the class, the higher the probability of death.

5. Here we add new columns that will help us to calculate proportions of survived people in the last part.


In [31]:
titanic.insert(0, 'alive', 0)
titanic.insert(0, 'dead', 0)
titanic.insert(0, 'ratio', 0)

Here we set these new columns to appropriate values. We essentialy separate the survived columns for easier summing later on. Finnaly we slice the data to take only the columns of interest.


In [32]:
titanic.loc[titanic['survived'] == "Survived", 'alive'] = 1
titanic.loc[titanic['survived'] == "Deceased", 'dead'] = 1
df = titanic[['pclass', 'sex', 'alive', 'dead', 'ratio']]

We group the data by the sec and class of the passangers and we sum it. Then we have the sum of alive and dead people groupped as we wish and we can easily calculate the proportion of them that survived, which we plot as a histogram.


In [33]:
aggregated = df.groupby(['sex', 'pclass']).sum()
(aggregated['alive'] / (aggregated['alive'] + aggregated['dead'])).plot(kind='bar')


Out[33]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f471510e5c0>

We can see that there is a huge difference of survival between the classes and sexes : for instance, the third class males have 7 less times chance of survival than the first class females.

6. Next we insert a new column that will be the age category of each person. Since we wan't to split the people in two equal groups based on age, we compute the median age of passangers. We also drop the passengers with an unknown age value, to avoid bad results for the median computation.


In [34]:
titanic.dropna(axis=0, subset=['age'], inplace=True)
titanic.insert(0, 'age_category', 0)
median = titanic['age'].median()

Next, we set the correct category to people below or above the median age. The people that have the median age are grouped with the people below it. Next we set this column as a categorical column.


In [35]:
titanic.loc[titanic['age'] > median, 'age_category'] = "Age > " + str(median)
titanic.loc[titanic['age'] <= median, 'age_category'] = "Age <= " + str(median)
titanic['age_category'] = titanic.age_category.astype('category')

Next, we take the columns that are of interest to us and group by age category, sec and travel class. Then we sum over these groups, obtaining the people that lived and those that died which which we can compute the proportion and display it as a dataframe.


In [36]:
sub = titanic[['pclass', 'sex', 'age_category', 'alive', 'dead', 'ratio']]
subagg = sub.groupby(['age_category', 'sex', 'pclass']).sum()
subagg['ratio'] = (subagg['alive'] / (subagg['alive'] + subagg['dead']))
only_ratio = subagg[['ratio']]
only_ratio


Out[36]:
ratio
age_category sex pclass
Age <= 28.0 female First Class 0.951220
Second Class 0.910714
Third Class 0.513761
male First Class 0.531250
Second Class 0.216216
Third Class 0.183036
Age > 28.0 female First Class 0.967391
Second Class 0.872340
Third Class 0.372093
male First Class 0.302521
Second Class 0.083333
Third Class 0.144000

As before, we can see that there is a huge difference of survival between the classes and sexes. In the other hand, the age doesn't make a large difference : no matter if the passenger is above or below 28 years old, its probability of survival is more determined by its sex and class.